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Executive  Summary 


The  purpose  of  this  report  is  to  demonstrate  the  utility  of  an  information -theoretic  approach  to  next  generation  chemical 
detection.  Research  and  development  of  chemical  sensing  systems  that  are  effective  for  a  variety  of  field  environments 
outside  the  laboratory  is  an  ongoing  challenge.  Given  the  requirements  for  discerning  potentially  hundreds  to  thousands 
of  target  analytes  from  an  even  larger  number  of  background  contaminants,  sensors  capable  of  generating  multivariate 
data  -  such  as  spectral  sensing  systems  -  have  become  a  central  focus  for  new  system  designs  due  to  their  prior 
successes  in  the  laboratory. 

Spectral  data  capture  a  wealth  of  chemical  information  in  a  variety  of  sensing  modalities.  While  sharing  a  common 
multivariate  structure,  spectral  data  often  exhibit  characteristics  that  can  be  challenging  to  model.  Traditional  statistical 
methods  often  focus  on  typical  or  most  frequent  values  in  the  data.  But  spectral  data  tend  to  be  sparse,  characterized 
instead  by  large  values  in  peaks  that  occur  infrequently.  It  is  sparseness  that  gives  spectral  data  their  robustness  for 
chemical  identification. 

Information  theory  provides  an  alternate  framework  for  modeling  sparse  spectral  data.  Information  measures  are 
methods  that  can  quantify  how  effective  chemical  spectra  are  at  discriminating  between  hypotheses  such  as  potential 
target  analytes,  even  if  those  spectra  arise  from  different  sensing  modalities.  Providing  additional  flexibility,  information 
measures  operate  on  probability  distributions  that  can  be  developed  directly  from  data  themselves,  when  available,  or 
from  statistical  models  or  simulations  of  sensors  when  not. 

The  application  of  information  theory  to  problems  in  analytical  chemistry  is  not  new.  However,  its  inclusion  in 
chemometrics  for  modeling  large  scale  sets  of  multivariate  spectral  data  that  have  only  recently  become  available  is 
novel.  Recent  research  at  the  Naval  Research  Laboratory  (NRL)  has  yielded  probabilistic  models  for  spectral  data  that 
enable  the  computation  of  information  measures  such  as  entropy  and  divergence,  with  the  goal  of  developing  feature  sets 
to  increase  the  sensitivity  and  selectivity  of  multivariate  chemical  sensors  of  several  modalities.  Results  are  presented  for 
several  types  of  spectral  data  in  multisensor  systems,  as  well  as  strategies  for  using  information  measures  with  other  data 
sources.  Binary,  univariate,  and  multivariate  sensors  can  all  be  modeled  from  an  information-theoretic  perspective, 
making  it  well-suited  for  the  challenges  of  next  generation  chemical  detection. 
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Introduction 


Accurate,  reliable,  and  consistently  effective  chemical  sensing  outside  the  laboratory  remains  a  challenge  for  the  diverse 
needs  of  the  joint  armed  forces.  Physical  requirements  for  the  ideal  sensor  vary  widely  among  the  services,  as  do 
performance  requirements.  Optimal  performance  is  also  needed  for  a  variety  of  environments  with  diverse  backgrounds 
and  potentially  contaminating  interferants  that  may  vary  widely  as  well1.  As  a  consequence,  next  generation  chemical 
detection  remains  an  area  of  active  research  and  development2. 

One  of  the  most  common  applications  for  chemical  sensing  in  the  field  (i.e.,  outside  the  laboratory)  is  distinguishing 
target  analytes  in  the  presence  of  background  chemical  compounds  that  may  be  entirely  unknown  or  differ  markedly 
across  the  many  environments  where  the  sensors  are  deployed.  In  the  laboratory,  multivariate  spectral  instruments  are 
often  relied  on  by  trained  analytical  chemists  for  their  ability  to  elucidate  chemical  compositions  of  samples  where  the 
chemical  constituency  may  be  entirely  unknown.  Adapting  spectral  instruments  for  field  sensing  applications  or 
autonomous  sample  analysis  often  means  replacing  the  trained  chemist  with  algorithms  that  compare  sample  spectra  with 
a  library  of  spectra  from  known  target  analytes  and  common  background  compounds.  Typically,  a  distance  metric  is 
used  to  quantify  the  separation  between  the  sample  spectrum  and  a  candidate  library  spectrum.  The  candidate  with  the 
minimum  distance  above  a  certain  threshold  or  match  factor  is  selected.  Examples  of  this  approach  are  used  in  the 
industry  standard  NIST  software  for  mass  spectra  analysis3. 

Traditional  statistical  methods  for  data  modeling  and  analysis  have  focused  on  finding  the  most  likely  or  most  common 
value  in  fitting  data  to  a  probability  distribution.  For  many  sparse  data  types,  such  as  those  generated  in  mass  (MS)  and 
ion  mobility  (IMS)  spectrometry,  infrared  spectroscopy  (IR),  and  gas  chromatography  (GC)  data,  the  most  common 
values  do  not  typically  characterize  the  data.  Rather,  it  is  the  values  of  the  uncommon  or  infrequent  peaks  that  uniquely 
define  spectral  data.  Figure  1  shows  an  example  of  a  GC-MS  total  ion  chromatogram  (TIC)  of  a  Navy  diesel  fuel 
together  with  examples  of  spectral  data  for  cycloheptane.  Fitting  a  data  model  to  the  peaks  means  fitting  the  sparse  large 
values  in  the  tail  of  the  data  distribution,  which  is  challenging  as  they  lie  far  from  the  most  common  values  near  zero. 

A  further  challenge  for  autonomous  spectral  matching  is  that  the  relative  intensity  values  of  peaks  in  spectral  data  are  not 
generally  preserved  and  vary  strongly  with  concentration  of  the  analyte  in  the  sample.  Figure  2  shows  two  examples  of 
1,3,5-trimethylbenzene  spectra  obtained  with  a  laboratory  GC-MS  instrument  (left)  and  GC-IR  instrument  (right) 
overlaid  with  the  corresponding  intensity  profile  from  the  NIST  and  PNNF  libraries,  respectively.  The  relative  peak 
heights  between  the  sample  and  library  spectra  are  not  preserved.  In  Figure  2  (left),  the  maximum  peaks  appear  in 
different  m/z  bins,  even  though  the  general  profiles  of  the  two  spectra  are  quite  similar.  Peak  heights  increase  with 
increasing  analyte  concentration,  but  saturate  at  a  maximum  value  in  most  spectral  instruments.  In  Figure  2  (right),  the 
IR  spectra  are  shifted  with  respect  to  each  other,  and  this  shift  is  non-linear  with  respect  to  bin  number. 

Information  theory4  provides  content-agnostic  methods  that  can  be  used  to  characterize  sparse  spectral  data  and 
determine  its  most  significant  components.  These  tools  can  quantify  how  effective  sensor  spectra  are  at  discriminating 
between  hypotheses  (information  divergence),  how  informative  a  measurement  is  expected  to  be  (entropy),  how 
informative  a  particular  outcome  of  that  measurement  is  (self-information),  and  how  closely  associated  two  outcomes  are 
(pointwise  mutual  information).  As  a  group,  these  tools  are  information  measures  that  operate  on  probability 
distributions  that  can  be  developed  directly  from  the  data  or  from  statistical  models.  In  addition,  the  universality  of  an 
information-theoretic  approach  can  be  used  to  determine  the  information  gains  from  new  features  or  alternative  models 
with  wide  ranging  impact  for  systems  that  rely  on  data  fusion  such  as  chemical  sensing,  machine  vision,  and  robotics. 

Information  theory  was  first  applied  to  the  problems  of  analytical  chemistry  in  the  1970s5.  The  field  of  chemometrics 
subsequently  developed  analysis  tools  for  multivariate  spectral  data6.  The  purpose  of  this  report  is  to  demonstrate  how  an 
information-theoretic  approach  to  the  analysis  of  multivariate  spectral  data  is  an  effective  strategy  for  addressing  the 
unique  challenges  of  chemical  detection  in  the  field.  The  report  is  organized  as  follows:  the  Background  section 
establishes  why  spectral-based  sensing  is  considered  the  most  relevant  for  modality  for  next  generation  chemical 
detection.  The  Method  section  introduces  the  mathematical  framework  of  several  information  measures,  biasing 
schemes,  and  parameter  estimation.  The  Experiment  section  describes  the  spectral  data  sets,  performance  metrics,  and 
sensor  simulations.  Next,  results  and  discussion  are  presented,  followed  by  a  summary  of  conclusions. 
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Figure  1  Examples  of  spectral  data  types:  gas  chromatograph  mass  spectrometry  (GC-MS)  total  ion  count,  mass  spectrometry 
(MS),  infrared  spectroscopy  (IR),  and  ion  mobility  spectrometry  (IMS). 
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Figure  2  Comparison  of  mass  (left)  and  IR  (right)  spectra  with  NIST  and  PNNL  library  spectra  for  1,3,5- 

trimethylbenzene,  respectively. 
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Background 


Research  efforts  at  NRL7'10  have  focused  on  the  development  of  effective  data  fusion  techniques  for  multisensor  systems 
for  Navy  and  other  DoD  and  DHS  sensing  applications.  These  applications  often  require  the  integration  of  diverse 
sensing  methods  to  meet  coverage  and  performance  goals.  Targets  are  frequently  rare  events  and  potentially  catastrophic, 
which  adds  an  additional  challenge  for  hypothesis  tests  that  rely  on  prior  probabilities.  Backgrounds  are  often  uncertain 
and  vary  according  to  whichever  environment  the  system  is  placed.  To  be  effective  in  field  environments  then, 
multisensor  systems  must  be  robust  against  these  uncertain  and  potential  contaminating  backgrounds  while  maintaining 
efficient  detection  rates.  The  multivariate  nature  of  spectral -based  sensing  systems  (e.g.,  GC-MS,  GC-IR)  provides  an 
inherent  protection  against  unknown  interferants.  In  analytical  chemistry,  these  types  of  instruments  are  called  second 
order  instruments  as  they  can  detect  targets  and  maintain  calibration  even  in  the  presence  of  unknown  interfering 
compounds11. 

Figure  3  illustrates  the  robustness  present  in  infrared  spectroscopy,  for  example.  Here,  principal  component  analysis12 
(PCA)  was  performed  on  a  set  of  540  unique  IR  spectra.  PCA  is  a  standard  data  reduction  technique  in  chemometrics 
that  functions  by  projecting  high-dimensional  data  onto  a  lower-dimensional  subspace  that  maximally  describes  the 
sample-to-sample  variance  present  in  the  data.  The  cumulative  variance  represented  by  the  IR  spectra  -  one  measure  of 
their  information  content  -  reaches  the  90%  level  with  less  than  50  of  the  540  factors,  which  indicates  a  high  degree  of 
redundancy  is  present  in  this  set  of  IR  spectra.  This  redundancy  gives  spectral  data  an  ability  to  represent  a  large  number 
of  chemical  compounds  with  unique  signatures  while  maintaining  robustness  to  unknown  interferants. 

Under  DTRA  sponsorship,  NRL  is  actively  working  to  bring  the  robustness  of  spectral  sensors  to  field -able  multisensor 
instruments  through  an  information -theoretic  approach  to  sensor  data  analysis  and  fusion,  as  well  as  explicitly 
incorporating  hypothesis  testing  that  combines  independent  evidence  for,  evidence  against,  and  uncertainty  to  target 
analyte  detection. 


Figure  3  (Left)  A  set  of  540  unique  IR  spectra.  (Right)  Cumulative  variance  (solid  curve)  represented  as  a  function  of  the 
number  of  factors  after  principal  components  analysis;  line  of  90%  of  variance  explained  (dotted  line). 
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Method 


Information  measures 

Information  measures  are  content-agnostic  tools  that  can  be  used  to  characterize  measurements  and  determine  their  most 
significant  components.  They  operate  on  probability  distributions  that  can  be  computed  directly  from  data,  or  from 
simulations  of  sensors,  or  from  statistical  models. 

Self-information  is  a  property  of  the  outcome  of  a  measurement.  Also  called  the  “surprisal”,  it  quantifies  how  unusual 
the  outcome  is.  For  a  measurement  x0  sampled  from  a  distribution  p(x),  the  self-information  is: 


Kx  o)  =  -log2[p(x  o)] 


(1) 


This  gives  a  measure  of  how  informative  the  outcome  is.  If  p(x0)  =  1,  then  the  result  is  deterministic,  giving  no 
information:  7(x0)  =  0.  An  outcome  with  a  self-information  of  n  bits  gives  the  same  information  as  1  out  of  2n  equally 
probable  events. 

The  Shannon  entropy  (//)  is  a  property  of  the  sensor’s  distribution  in  the  measurement  space  -  it  is  the  average  self¬ 
information: 


HO 0  =  E*  [/(£)]  =  -  V  p(x)log2[p(x)] 


(2) 


It  tells  the  average  number  of  bits  of  information  given  by  a  measurement  sampled  from  p(x).  The  entropy  H(X )  =  0 
when  only  one  distinct  outcome  is  possible  (e.g.,  all  samples  are  concentrated  in  one  location  in  the  measurement  space), 
and  is  maximized,  H(X )  =  log2(n] ),  when  each  outcome  is  equally  probable  (e.g.,  n  samples  are  spread  uniformly  over 
the  measurement  space).  Together  these  information  measures  can  be  used  to  quantify  how  useful  specific  sensing 
measurements  are;  for  example,  determining  which  m/z  bins  in  a  mass  spectrum  are  most  important  for  discriminating 
between  a  particular  sample  and  a  library  of  target  analyte  spectra  (self-information),  or  which  are  most  important  on 
average  for  discriminating  between  members  of  the  library  of  targets  and  typical  background  spectra  (entropy). 


The  Kullback-Leibler  divergence  ( DKL )  is  a  tool  for  comparing  two  distributions: 

\p(x) 

p  IX)  Log  2 

XEX 


HKi[p{x)\\q{x)]  =  ^  p(x)log2 


q(x) 


(3) 


The  Kullback-Leibler  divergence  is  zero  when  p(x)  =  q(x)  everywhere,  and  increases  as  q(x)  becomes  a  worse 
approximation  to  p(x).  DKL  is  specifically  useful  for  data  fusion  applications  when  p(x)  is  set  to  p(x\H1)  and  q(x)  is  set 
to  p(x|//0),  where  H0  and  Hj  are  hypotheses  being  tested  (e.g.,  “analyte  X  is  present  at  or  above  concentration  Y”  versus 
“background  is  present”).  In  this  case,  DKL  measures  the  expected  number  of  bits  of  information,  per  measurement,  for 
discriminating  in  favor  of  //7  when  H1  is  true.  Since  the  end-goal  of  any  analytical  task  is  to  differentiate  between  a  set  of 
hypotheses,  a  statistical  measure  of  the  system’s  capability  to  do  just  that  is  a  robust,  modality -independent  predictor  of 
performance. 


The  Dkl  is  an  example  of  one  of  several  possible  divergence  measures  that  can  be  constructed  for  the  probability 
distributions  p(x)  and  q(x).  An  alternative  is  the  Jensen-Shannon  (JS)  divergence,  DJS: 


DJS[p(x)\\q(x)]  =^(DKL[p(x)\\m(x)]  +  DKL[q  (x)\\m(x)]) 


(4) 
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Here,  ra(x)  =  (p(x)  +  q(x))/2  is  the  average  of  the  two  distributions.  This  quantity  is  closely  related  to  the  Kullback- 
Leibler  divergence;  if  a  measurement  is  drawn  at  random  from  p(x)  or  q(x),  DJS[p(x)\  |q(x)]  gives  the  expected 
number  of  bits  of  information,  per  measurement,  for  deciding  which  distribution  was  chosen.  If  p(x)  is  set  to  p^xlH^ 
and  q(x)  is  set  to  p(x|//0),  this  gives  the  expected  amount  of  information  for  discriminating  between  the  hypotheses  H0 
and  Hj  when  a  sample  is  drawn  at  random  from  one  of  them.  The  JS  divergence  has  the  advantage  that  it  is  bounded  both 
above  and  below  and  remains  well-defined  even  for  regions  of  measurement  space  where  the  probability  distributions 
may  be  zero- valued  (unlike  DKL ,  which  requires  absolute  continuity). 


The  pointwise  mutual  information,  Ip ,  is  a  property  of  two  outcomes  of  a  measurement: 


lv(x,y)  =  log  2 


p(x,y) 
P(x)p(y ) 


=  I(x)  -  I (x\y) 


(5) 


The  Ip  is  an  information-theoretic  generalization  of  the  concept  of  statistical  correlation.  It  computes  the  change  in 
mutual  information  in  result  x,  given  that  result  y  is  known  (and  vice-versa).  If  the  two  outcomes  are  uncorrelated  (that 
is,  the  outcome  y  doesn’t  affect  the  probability  for  x),  then  the  Ip  is  equal  to  zero.  Positive  or  negative  values  will  be 
taken  if  the  two  outcomes  occur  together  more  or  less  often  than  they  would  if  they  were  uncorrelated,  respectively.  Ip 
can  be  normalized  by  /(x,y),  so  that  its  range  extends  from  -1  (i.e.,  outcomes  x  and  y  never  occur  together)  to  +1  (i.e., 
outcomes  x  and  y  always  occur  together).  Tests  of  correlation  are  important  tools  for  validating  data  fusion  algorithms, 
as  many  popular  methods  assume  statistical  independence  (e.g.,  naive  Bayes,  Dempster- Shafer  Theory).  By  using  a 
pointwise  test,  the  regions  of  measurement  space  where  statistical  independence  is  valid  can  be  ascertained. 


Distance  metrics 

Four  distance  metrics  were  evaluated  as  features  in  this  work.  They  were  the  cosine  distance,  Pearson’s  correlation 
distance,  the  cityblock  or  L7  distance,  and  the  euclidean  or  L2  distance.  Note  that  the  cosine  metric  is  used  in  the  NIST 
spectral  matching  algorithm13.  Table  1  summarizes  the  distance  metrics. 


Table  1  -  Formula  used  for  computing  distance  between  sample  and  candidate  library  spectra. 


Distance  metric 

Formula 

cosine 

1  P-9 

imii 

correlation 

cov(p,  q ) 

(Jp(Jq 

cityblock  (Lj) 

/  \p(xn)  -  q(xn)\ 

euclidean  (L2) 

(2  (pOn)  -  <?On))2) 

Bias  functions 

A  linear  bias  function  was  constructed  to  operate  on  mass  spectral  data  where  the  data  in  high  m/z  bins  were  weighted 
more  strongly  than  low  m/z  bins.  A  weight  parameter,  w,  was  multiplied  against  the  vector  of  m/z  bin  values  [1,  2,  3,  . . . 
n ] .  The  weight  parameter  thus  acted  as  the  slope  of  line  that  could  be  varied  to  implement  the  desired  amount  of  bias  to 
the  intensity  values  of  any  given  spectrum. 

A  non-linear  bias  function  was  constructed  from  the  sigmoid  function,  a  standard  function  frequently  employed  for 
logistic  regression.  The  shifted  sigmoid  function 
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»|C*..I0  =  2-((1+  1) 


(6) 


can  be  defined  using  one  parameter  6  to  weight  all  intensities  equally,  or  as  a  vector  0  that  can  weight  the  intensity  in 
each  bin  independently. 


Parameter  estimation 

Information  measures  provide  a  means  to  estimate  model  parameters,  for  instance,  optimizing  values  for  denoising 
spectral  signals.  In  both  cases,  the  JS  divergence  was  used  to  locate  the  optimal  value.  A  data  set  of  540  noised  infrared 
spectra  was  generated  by  adding  Gaussian  noise  independently  to  each  wavenumber  bin  of  individual  spectra.  The  noise 
was  generated  using  a  mean  of  0  and  a  standard  deviation  of  5%  of  the  maximum  peak  height  in  any  given  spectrum. 
Two  approaches  to  denoising  infrared  spectra  were  implemented.  One  was  thresholding;  data  below  a  selected  threshold 
were  truncated  to  zero.  Selection  of  the  optimal  threshold  was  a  trade  off  between  removing  noise  (where  a  higher 
threshold  was  better)  and  preserving  spectral  information  (lower  threshold  better). 

A  low  pass  filter  was  used  as  an  alternative  denoising  method.  A  Fourier  transform  was  applied  to  the  noised  infrared 
spectra.  Frequency  data  were  truncated  above  a  selected  cut  off  frequency.  Selection  of  the  optimal  cut  off  frequency  was 
again  a  trade  off  between  removing  noise  (where  a  lower  frequency  was  better)  and  preserving  spectral  information 
(higher  frequency  better). 


Experiment 

Sources  of  large  sets  of  high-quality  spectral  data  are  available.  One  of  these  is  the  National  Institute  of  Standards  and 
Technology  (NIST)  08  mass  spectrometry  library,  a  collection  of  reference  electron-ionization  mass  spectra  (EI-MS)  of 
about  192,000  compounds.  Another  is  the  Department  of  Energy  /  Pacific  Northwest  National  Laboratory  (PNNL) 
Fourier-transform  infrared  (FTIR)  spectroscopy  library,  a  collection  of  high-resolution  infrared  spectra  from  more  than 
500  compounds  including  chemical  agents,  agent  simulants,  and  common  toxic  industrial  chemicals14'15.  The  mass 
spectral  data  were  provided  at  unit  mass -to -charge  (m/z)  resolution  with  intensities  in  a  range  of  0  to  1000  counts. 
Infrared  spectra  were  provided  at  a  resolution  of  0.1  wave  numbers  (cm1)  over  a  range  of  approximately  500  to  4000  cm' 
1  with  intensities  from  0.0  to  0.0569  absorption  units.  These  two  databases  were  cross-referenced  to  determine  the 
compounds  common  to  both  data  sets,  which  form  a  joint  database  (JDB)  library  of  540  unique  compounds. 

Several  test  data  sets  were  constructed  from  these  high-quality  data  sets  to  challenge  matching  algorithms  with  a  range  of 
different  noisy  and  uncertain  spectra  with  known  and  unknown  classifications.  One  test  set  (TS-JDB)  of  5500  mass 
spectra  was  constructed  from  the  JDB  library  of  540  selected  spectra.  It  consisted  of  three  sets  of  1000  randomly  selected 
example  spectra  corrupted  by  Gaussian  noise  at  levels  of  0.3,  0.5,  and  1.0  standard  deviations  as  scaled  by  the  maximum 
peak  intensity.  An  additional  1000  random  selected  spectra  were  randomly  permuted  to  generate  spectra  that  were  non¬ 
physical  but  preserved  the  entropy  of  the  source  spectra.  Another  1000  spectra  were  constructed  to  mimic  co-eluting 
compounds  by  adding  two  randomly  selected  spectra  in  a  random  proportion.  Co-eluted  spectra  that  were  added  with  a 
proportion  of  0.3  or  less  were  considered  as  examples  of  the  more  prevalent  compound;  the  remaining  spectra  were 
classified  as  non-matchable.  Finally,  500  randomly  selected  and  unaltered  spectra  were  included  as  a  truth  set. 

Other  test  sets  (TS-NIST)  of  noisy  mass  spectra  were  constructed  by  adding  random  peaks  to  randomly  selected  spectra 
from  the  full  NIST  database  to  simulate  altered  relative  peak  intensities.  For  each  sample  spectrum,  between  1  and  20 
m/z  bins  were  randomly  selected,  and  their  abundance  values  were  increased  by  a  random  value  between  100  and  500 
(i.e.,  10%  -  50%  as  the  spectra  are  pre-scaled  to  a  maximum  abundance  of  1000  units).  Uniform  distributions  were  used 
for  generating  all  random  values. 

Test  sets  (TS-PNNL)  of  noisy  IR  spectra  were  prepared  from  sample  FTIR  spectra  by  adding  Gaussian  white  noise  to 
spectra  from  the  JDB  database.  Each  wavenumber  bin  was  shifted  by  an  independently  chosen  random  value  to  simulate 
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the  bin  shifting  frequently  observed  in  IR  spectra.  Shift  amounts  were  sampled  from  a  normal  distribution  with  zero 
mean  and  standard  deviation  equal  to  5%  of  the  spectrum's  maximum  value. 

Correct  matches  (i.e.,  classifications)  served  as  the  metric  for  evaluating  algorithm  or  feature  performance.  Spectra  from 
a  test  set  were  compared  individually  to  candidate  spectra  chosen  from  a  subset  of  spectra  selected  from  either  of  the 
high-quality  MS  or  IR  data  sets.  Comparisons  were  achieved  using  a  distance  metric  or  probabilistic  measure.  The 
candidate  spectrum  with  the  smallest  distance  value  compared  to  the  test  set  spectrum  was  selected  as  the  “match”. 
Tallies  of  correct  and  incorrect  matches  were  used  to  generate  confusion  matrices  and  ROC  curves  to  summarize 
performance16.  The  area  under  the  ROC  curve  (AUC)  metric  was  also  used  in  performance  comparisons17. 

Additionally,  systems  of  one,  two,  and  three  univariate  sensors  were  simulated  using  Gaussian  sensor  response 
distributions  for  target  analyte  and  background  signal.  The  degree  of  overlap  was  varied  between  the  sensor  responses 
and  Gaussian  noise  was  added.  The  performance  of  the  sensor  systems  was  evaluated  using  both  AUC  and  information 
divergence  metrics. 


Results  and  Discussion 


Self  information 

Self-information  was  used  as  a  measure  to  quantify  how  informative  each  m/z  bin  is  in  a  mass  spectrum,  given  a  corpus 
of  mass  spectra  such  as  the  NIST  library.  Figure  4  shows  the  mass  spectrum  of  dimethoxy-benzamide,  along  with  the 
self-information  of  each  bin.  For  this  calculation,  the  m/z  bins  were  assumed  to  be  statistically  independent;  a  peak  at 
one  bin  did  not  affect  the  chances  of  a  peak  occurring  at  another  bin  (a  rather  stringent  assumption  that  is  further 
discussed  and  tested  below).  With  this  simplifying  assumption,  the  probability  of  a  peak  at  each  bin  was  straightforward 
to  compute  for  the  NIST  database.  While  the  qualitative  general  knowledge  that  higher  m/z  bins  are  more  informative 
was  reflected  here,  a  quantitative  interpretation  is  available  as  well.  With  nearly  192,000  spectra,  distinguishing  a 
spectrum  in  the  NIST  database  takes  (approximately)  a  minimum  of  log2  (192,000)  «  17.5  bits  of  information,  the 
equivalent  of  one  bit  per  spectrum. 

In  Figure  4,  the  highest  self-information  m/z  bin  -  near  360  -  for  dimethoxy-benzamide  has  only  5.3  bits.  While  that  bin 
provides  the  most  information  for  distinguishing  that  chemical,  it  is  not  sufficient  to  uniquely  identify  the  compound  on 
its  own  (5.3  «  17.5).  However,  the  self-information  I(x)  is  extensive.  Any  feature-extraction/matching  algorithm  for 
distinguishing  this  chemical  from  those  in  the  NIST  database  must  incorporate  enough  peaks  so  that  the  sum  of  their 
self-information  is  above  17.5.  For  example,  the  five  highest  peaks  in  self-information  are  sufficient  to  meet  the  criteria; 
the  five  highest  peaks  in  the  spectrum  (-13  bits)  are  not.  Note  that  their  locations  in  m/z  differ.  Thus,  the  most  important 
peaks  for  distinguishing  the  chemical  against  unknown  background  are  not  necessarily  those  of  the  most  common  mass 
fragments. 


Information  divergence  and  AUC 

Recent  results  using  simulated  sensor  data  have  demonstrated  a  connection  between  measures  of  information  divergence 
such  as  Dkl  and  DJS  and  the  area  under  a  ROC  curve  (AUC)  performance  metric.  The  information  divergence  of  a 
sensor’s  measurements  is  a  statistical  property  of  the  sensor,  whereas  the  AUC  metric  is  an  empirical  property  describing 
the  sensor’s  performance  that  encompasses  those  measurements.  Figure  5  shows  a  logit  plot  of  the  AUC  against  DJS  for 
simulated  arrays  of  one,  two,  and  three  univariate  sensors  with  random  parameters  and  Gaussian  noise.  The  data  from  all 
three  array  sizes  collapse  onto  a  single  curve  -  so  a  lone  sensor  with  DJS  =  0.7  bits  will  have  approximately  the  same 
performance  as  an  array  of  sensors  with  DJS  =  0.7  bits.  That  is,  an  array  with  a  small  number  of  high-performing  sensors 
may  outperform  an  array  with  more,  but  lower -performing,  sensors.  DJS  provides  a  modality-independent  measure  for 
comparing  them.  The  logit  scale  is  used  to  emphasize  details  in  the  region  of  high  performance  (areas  of  0.95  <  v  <  1.0). 
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N-(2-Allylcarbamoyl-4-chloro-phenyl)-3,4-dimethoxy-benzamide  N-(2-Allylcarbamoyl-4-chloro-phenyl)-3,4-dimethoxy-benzamide 


Figure  4  Mass  spectrum  (left)  and  self-information  (right)  of  dimethoxy-benzamide.  The  five  largest  valued  peaks  are  circled 

(red)  in  both  plots. 


Jensen-Shannon  divergence  (DJS) 


Figure  5  Logit-logit  plot  of  area  under  ROC  curve  (AUC)  versus  DJS  for  simulated  1-,  2-,  and  3-sensor  systems. 


Feature  selection 

The  Jensen-Shannon  divergence,  DJS ,  may  similarly  be  used  as  a  measure  of  the  effectiveness  of  a  feature -extraction 
method  when  working  with  high-dimensional  spectral  data.  Different  features  can  be  quantitatively  compared  based  on 
the  Jensen-Shannon  divergence  of  the  distributions  they  create.  In  addition,  the  free  parameters  of  a  feature  extraction 
algorithm  can  be  optimally  selected  by  maximizing  DJS.  Several  features  based  on  distance  metrics  and  modifications  of 
distance  metrics  were  evaluated  for  mass  spectral  data.  Simulated  “sample”  spectra  from  test  set  TS-NIST  were 


8 


compared  to  matching  and  nonmatching  NIST  library  spectra  using  these  features.  From  these  data,  smooth  distributions 
of  the  distances  between  sample  and  matching  library  spectra,  and  between  sample  and  nonmatching  library  spectra, 
were  built  using  Gaussian  kernel  density  estimation.  Rather  than  selecting  the  next  nearest  matching  spectra  for  the 
nonmatching  distance,  the  nonmatching  library  spectra  were  selected  at  random  from  the  list  of  all  nonmatching 
chemicals  for  each  sample  spectrum  separately.  The  Jensen- Shannon  divergence  between  these  two  distributions  was 
then  computed  and  used  to  evaluate  the  performance  gain  (or  loss)  with  that  feature. 

Different  distance  metrics  capture  different  information  about  spectral  data.  Figure  6  shows  one  measure  of  information 
obtained  from  a  set  of  540  mass  spectra  and  the  distance  metrics  given  in  Table  1.  Points  in  these  scatter  plots  represent 
the  separation  between  pairs  of  different  spectra  as  computed  with  different  distance  metrics;  for  example,  Figure  6 
(upper  left),  with  correlation  distance  (y-axis)  and  cosine  distance  (x-axis).  Distance  metrics  that  provide  similar 
information  will  generate  points  in  such  a  scatter  that  lie  along  the  diagonal  bisection  line,  as  with  the  correlation  and 
cosine  metrics  in  Figure  6  (upper  left).  Metrics  that  measure  more  complementary  information  will  scatter  farther  from 
this  line;  for  example,  Figure  6  (upper  right,  lower  left,  and  lower  right). 


1.0  1.5  2.0  2.5  3.0  3.5 

euclid 


Euclidean  (L2) 


1.0  1.5  2.0  2.5  3.0  3.5 

euclid 


Euclidean  (Z_2) 


Figure  6  Differences  in  correlations  for  different  distance  metrics,  as  labeled.  The  amount  of  scatter  is  a  measure  of 

complementary  information. 


Linear  bias 

The  first  set  of  feature  extraction  methods  tested  involved  application  of  a  linear  weight  to  the  MS  data  before  comparing 
them  with  a  distance  metric.  Since  higher-m/z  bins  are  more  informative  by  proximity  to  the  molecular  ion,  it  was 
reasoned  that  biasing  the  analysis  toward  them  could  improve  matching  performance.  Each  bin  was  multiplied  by  a 
linear  weight  w  =  (1  +  ax),  where  x  was  the  m/z  value  of  the  bin  and  a  was  the  scaling  factor  to  be  varied.  Figure  7 
(left)  shows  the  Jensen-Shannon  divergence  (on  a  logit  scale)  for  the  four  distance  metrics  of  Table  1  as  a  function  of  the 
linear  scaling  factor  a.  In  the  a  «  0  case  (far  left),  distances  were  measured  between  nearly  unweighted  spectra.  As  the 
scaling  factor  was  increased,  DJS  decreased  and  then  saturated  for  each  distance  metric.  The  relative  performance  of  the 
different  metrics  remained  approximately  unchanged.  This  result  suggests,  then,  that  the  linear  bias  method  would 
actually  decrease  performance. 
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Non-linear  filtering 

The  second  set  of  methods  tested  involved  the  application  of  a  nonlinear  sigmoid  (logistic)  filter.  Since  real  spectra  can 
show  significant  variability  in  peak  height  (see,  for  example,  Figure  2),  it  was  reasoned  that  suppressing  those 
differences  with  a  logistic  function  could  improve  matching  performance.  The  abundance  of  each  bin  in  the  spectrum 
was  rescaled  by  the  sigmoid  function,  0*0^,  0),  of  eq.  (6).  Here  x  was  the  spectral  abundance  vector  and  6  was  the 
variable  scale  factor.  The  sigmoid  was  shifted  and  scaled  from  the  standard  logistic  function,  so  that  when  6  «  1  the 
distance  metrics  of  the  unweighted  spectra  would  be  recovered  (up  to  a  constant  factor,  which  did  not  affect  DJS  values). 
Figure  7  (right)  shows  the  Jensen-Shannon  divergence  (on  a  logit  scale)  for  the  four  distance  metrics  as  a  function  of  the 
scale  factor  0.  These  results  are  considerably  more  interesting  than  the  linear  weight  case.  As  6  increased  from  a  very 
small  value,  DJS  decreased  for  all  metrics,  only  to  reach  a  minimum  near  6  «  10-2  before  increasing.  For  higher  values 
of  6 ,  the  divergences  of  the  cityblock  and  euclidean  metrics  (i.e.,  the  L7  and  L2  norms)  were  substantially  better  than  for 
the  unweighted  data  case.  However,  the  two  metrics  with  the  best  performance  for  unweighted  data  (correlation  and 
cosine  distances)  saw  no  improvement. 


Scaling  factor  w 


Scaling  factor  6 


Figure  7  Logit  plot  of  D JS  versus  linear  (left)  and  sigmoid  (right)  scale  factors  for  four  distance  metrics. 


The  JS  divergence  can  be  used  to  show  the  effects  of  non-linear  filtering  for  each  distance  metric  separately.  Figure  8 
shows  the  separation  distributions  for  matched  spectra  (blue)  versus  randomly  selected  mismatches  (green)  for  four 
distance  metrics  when  the  scaling  factor  0  =  0.  The  distributions  are  histograms  of  the  distance  values  computed  with 
each  metric.  Less  overlap  between  the  distributions  indicates  better  distinguishability  of  spectra  and  a  commensurable 
increase  in  DJS.  Figure  9  shows  the  separation  distributions  for  scaling  factor  0  =  1000.  The  distributions  have  generally 
narrowed  and  shifted  to  the  left  after  application  of  the  sigmoid  filter. 

Figure  10  shows  the  effects  of  the  sigmoid  filter  on  the  correlation  and  euclidean  distance  metrics  in  scatter  plots  for 
three  values  of  parameter  6\  0.01,  1,  and  1000  for  matched  and  unmatched  spectra.  As  the  scaling  parameter  increased, 
the  two  distributions  changed  shape;  at  low  values  (far  left)  they  overlapped,  and  at  high  values  (far  right)  a  clear 
separation  can  be  seen  in  one  direction,  but  not  the  other.  Figure  1 1  revisits  the  aforementioned  link  between  AUC  and 
DJS ,  using  the  results  of  the  linear- weighting  and  sigmoid  filter  experiments.  For  both  experiments,  the  data  from  all  four 
distance  metrics  approximately  collapse  onto  a  single  curve.  A  logit-logit  scale  is  used  to  show  that  this  relationship 
extends  to  extremely  high  performances  in  the  range  of  AUC  >  0.95. 


10 


Figure  8  Scaling  factor  0  —  0  -  separation  between  matched  (blue)  and  mismatched  (green)  spectral  distributions  for  four 

distance  metrics. 


Figure  9  Scaling  factor  0  =  1000  -  separation  between  matched  (blue)  and  mismatched  (green)  spectral  distributions  for  four 

distance  metrics. 
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Euclidean  distance 


Figure  10  Correlation  vs  euclidean  distance  for  matched  spectra  (blue)  and  unmatched  spectra  (green)  for  three  different  scale 

factors. 


1.0  r 

0.99997 

0.99983  - 

CO 

Q 

CD 

0.99892  - 

O 

c 

CD 

O) 

CD 

0.99331  - 

> 

b 

CO 

—> 

0.95956  - 

0.79139  - 

o37m 

a  cosine-linear 
a  euclidean-linear 
a  correlation-linear 
a  cityblock-linear 

■  cosine-sigmoid 

■  euclidean-sigmoid 

■  correlation-sigmoid 

■  cityblock-sigmoid 


" 


▲AAA 


0.25288  0.69706  0.93991  0.99068  0.99862 

Threshold  (fraction  of  highest  peak) 


0.9998 


Figure  11  Logit-logit  plot  of  area  under  ROC  curve  (AUC)  versus  Dj$  for  linear  and  sigmoid  features. 
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Parameter  estimation  for  optimal  denoising 

The  noisy  IR  test  data  set,  TS-PNNL,  was  used  to  test  the  modality-independence  of  performing  feature-extraction 
optimization  with  the  information  measure  DJS.  Distances  were  calculated  after  intensities  in  all  bins  below  a  variable 
threshold  were  set  to  zero  (i.e.,  hard  thresholding).  Figure  12  illustrates  the  thresholding  procedure  for  a  sample  noised 
IR  spectrum.  The  position  of  the  threshold  represented  a  clear  optimization  problem:  a  threshold  set  too  low  let  in  most 
of  the  noise,  which  then  broadened  the  matching  and  nonmatching  distributions  and  decreased  performance.  A  threshold 
set  too  high  degraded  performance  by  discarding  too  much  useful  distinguishing  information.  Figures  13  and  14  show 
changes  in  the  separation  distance  between  the  distributions  before  (Figure  13)  and  after  (Figure  14)  thresholding  with  a 
given  value.  Figures  13  and  14  again  demonstrate  that  different  distance  metrics  measure  different  features  of 
multispectral  data.  The  optimal  threshold  value  was  determined  using  the  Jensen  Shannon  divergence.  Figure  15  shows 
DJS  as  a  function  of  threshold  value.  A  clear  maximum  value  can  be  determined  for  each  curve.  These  values  represent 
the  best  performance  over  all  spectra  for  the  given  distance  metrics. 

The  same  noisy  IR  test  data  set  was  used  for  the  alternative  denoising  method  employing  a  low  pass  filter.  In  this 
application,  a  Fourier  transform  was  applied  to  the  noised  infrared  spectra.  A  low  pass  filter  was  implemented  by 
truncating  frequency  data  below  a  selected  cut  off  frequency.  Figure  16  shows  application  of  a  given  low  pass  filter  to  a 
sample  noised  IR  spectrum.  Selection  of  the  optimal  cut  off  frequency  was  again  a  trade  off  between  removing  noise 
(where  a  lower  frequency  was  better)  and  preserving  spectral  information  (higher  frequency  better).  Again,  the  optimal 
cut  off  frequency  was  determinable  from  the  Jensen  Shannon  divergence.  Figure  17  shows  DJS  as  a  function  of  cut  off 
frequency  for  the  four  distance  metrics.  Again,  the  maximum  values  can  be  determined  for  each  curve  and  represent  the 
best  performance  over  all  spectra  for  the  given  distance  metric. 


OlOOCJG 


0.GC07G 


3^  jsaa  jaQQ  4000  .15(10  MOO  1(1(10  MS M  75 00  3500  40M  4500  5000 


wavenumber  (cm1) 


wavenumber  (cm1) 


Figure  12  Application  of  thresholding:  (left)  noised  IR  spectrum  (curve),  threshold  (dashed  line),  (right)  IR  spectrum  after 

denoising. 
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Figure  13  Distributions  of  matched  (blue)  and  mismatched  (green)  noise  IR  spectra  before  thresholding  for  four  distance 

metrics. 


Figure  14  Distributions  of  matched  (blue)  and  mismatched  (green)  noise  IR  spectra  after  thresholding  for  four  distance 

metrics. 
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Figure  15  D JS  for  selected  thresholds  over  noised  IR  data  for  four  different  distance  metrics. 
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Figure  16  Application  of  low  pass  filter  in  Fourier  space:  (left)  Fourier  transformed  noised  IR  spectrum  (curve),  cut  off 
frequency  (dashed  line),  (right)  IR  spectrum  after  denoising  with  low  pass  filter. 


Pointwise  mutual  information 

Finally,  Figure  18  shows  the  normalized  pointwise  mutual  information  computed  for  the  full  NIST  mass  spectral 
database.  The  points  in  the  thermal  plot  represent  how  the  likelihood  of  a  peak  at  one  m/z  bin  affects  the  likelihood  of  a 
peak  at  another  bin.  If  the  two  bins  are  uncorrelated,  the  value  is  near  zero;  points  on  the  diagonal  are  unity,  by 
definition.  Dark  blue  areas  along  the  x  and  y  axes  are  likely  artifacts  of  the  low  peak  count  when  m/z  <50.  The  high 
mutual  information,  especially  in  the  upper-right  corner,  suggests  that  the  independent-bin  approximation  may  be  a  poor 
one.  That  is,  the  presence  of  one  peak  at  a  high  m/z  bin  strongly  increases  the  likelihood  of  another  high  m/z  peak,  since 
one  high-mass  molecule  is  likely  to  have  several  high  m/z  peaks.  Figure  18  also  shows  strong  correlations  parallel  to  the 
diagonal,  spaced  about  Am/z=14  units  apart.  This  is  due  to  a  common  fragmentation  pattern  observed  in  electron 
ionization  mass  spectrometry  involving  loss  of  successive  methylene  subunits  (-CH2-)  of  larger  molecules.  While  these 
correlation  results  are  not  exactly  novel,  they  do  demonstrate  the  utility  of  pointwise  mutual  information  for  elucidating 
unknown  structure  in  a  data  set. 
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Figure  17  D JS  for  selected  cut  off  frequencies  over  noised  IR  data  for  four  different  distance  metrics. 
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Figure  18  Correlations  in  the  NIST  data  computed  from  using  pointwise  mutual  information. 
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Conclusions 


Chemical  sensing  data  such  as  those  from  spectral  instruments  are  typically  sparse  and  characterized  by  a  small  number 
of  large  valued  peaks.  Traditional  statistical  methods  that  focus  on  characterizing  the  most  frequent  measurements  using, 
for  example,  a  mean  or  standard  deviation  for  modeling  data,  have  not  performed  well  with  sparse  spectral  data.  Further, 
relative  peak  heights  and  peak  locations  in  spectral  data  can  vary  significantly  as  they  depend  strongly  on  the  input 
analyte  concentration  as  well  as  the  measurement  process.  Algorithms  that  match  sample  spectra  to  a  library  of  candidate 
targets  for  chemical  identification  must  be  robust  with  respect  to  these  characteristics  of  chemical  sensing  data.  Distance 
metrics  -  by  definition  -  treat  each  element  of  a  spectrum  as  equally  informative  and  thus  are  highly  susceptible  to 
variations  in  relative  peak  heights  and  locations.  While  an  improvement  over  traditional  statistical  modeling,  distance 
metrics  have  met  with  only  limited  success  in  large-scale  autonomous  identification  of  chemicals  from  sensor  data. 

An  information-theoretic  approach  that  focuses  on  differentiating  significant  from  insignificant  data  in  chemical  spectra 
(i.e.,  relevant  signal  from  irrelevant  background)  was  shown  as  a  potentially  effective  alternative  to  these  methods. 
Information  measures  form  an  effective  set  of  tools  for  analyzing  and  characterizing  spectral  data  and  multisensor 
systems:  individual  measurements  with  self-information,  typical  measurements  with  entropy,  correlations  with  mutual 
information,  and  relative  performance  with  divergence.  Much  of  their  utility  arises  from  their  independence  of  the  actual 
information  content;  modeling  how  chemical  sensors  generate  informative  data  is  not  required,  but  could  potentially 
yield  additional  information  for  distinguishing  chemicals. 

Information  measures  also  provide  a  means  to  quantify  correlations  and  performance  gains  for  different  types  of  data, 
features,  and  fusion  algorithms.  Divergence  metrics  are  directly  relatable  to  ROC  curves  and  the  AUC  performance 
metric,  and  thus,  are  able  to  measure  the  complementarity  of  sensor  responses  and  data  features.  Binary,  univariate,  and 
multivariate  sensors  can  all  be  modeled  within  this  framework,  making  it  well  suited  for  addressing  the  unique 
challenges  of  next  generation  chemical  detection  in  the  field. 
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